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Self-Exciting  Point  Process  Modeling  of  Crime 

G.  O.  Mohler,  M.  B.  Short,  P.  J.  Brantingham,  F.  P.  Schoenberg,  and  G.  E.  Tita 


Highly  clustered  event  sequences  are  observed  in  certain  types  of  crime  data,  such  as  burglary  and  gang  violence,  due  to  crime-specific 
patterns  of  criminal  behavior.  Similar  clustering  patterns  are  observed  by  seismologists,  as  earthquakes  are  well  known  to  increase  the  risk 
of  subsequent  earthquakes,  or  aftershocks,  near  the  location  of  an  initial  event.  Space-time  clustering  is  modeled  in  seismology  by  self¬ 
exciting  point  processes  and  the  focus  of  this  article  is  to  show  that  these  methods  are  well  suited  for  criminological  applications.  We  first 
review  self-exciting  point  processes  in  the  context  of  seismology.  Next,  using  residential  burglary  data  provided  by  the  Los  Angeles  Police 
Department,  we  illustrate  the  implementation  of  self-exciting  point  process  models  in  the  context  of  urban  crime.  For  this  purpose  we  use  a 
fully  nonparametric  estimation  methodology  to  gain  insight  into  the  form  of  the  space-time  triggering  function  and  temporal  trends  in  the 
background  rate  of  burglary. 

KEY  WORDS:  Crime  hotspot;  Epidemic  Type  Aftershock  Sequences  (ETAS);  Point  process. 


1.  INTRODUCTION 

Criminological  research  has  shown  that  crime  can  spread 
through  local  environments  via  a  contagion-like  process  (John¬ 
son  2008).  For  example,  burglars  will  repeatedly  attack  clusters 
of  nearby  targets  because  local  vulnerabilities  are  well  known  to 
the  offenders  (Bernasco  and  Nieuwbeerta  2005).  A  gang  shoot¬ 
ing  may  incite  waves  of  retaliatory  violence  in  the  local  set 
space  (territory)  of  the  rival  gang  (Tita  and  Ridgeway  2007; 
Cohen  and  Tita  1999).  The  local,  contagious  spread  of  crime 
leads  to  the  formation  of  crime  clusters  in  space  and  time. 

Similarly,  the  occurrence  of  an  earthquake  is  well  known  to 
increase  the  likelihood  of  another  earthquake  nearby  in  space 
and  time.  For  example,  we  plot  in  Figure  1  a  histogram  of  the 
times  between  “nearby  earthquakes,”  pairs  of  earthquake  events 
separated  in  space  by  110  kilometers  or  less,  for  all  recorded 
earthquakes  of  magnitude  3.0  or  greater  in  Southern  California 
during  2004—2005.  The  histogram  shows  a  spike  at  short  times, 
indicating  an  increased  likelihood  of  another  event  in  the  days 
following  each  earthquake.  For  a  stationary  Poisson  process  the 
distribution  of  times  between  pairs  of  events  would  be  approx¬ 
imately  uniform  when  the  length  of  the  time  window  is  much 
larger  than  the  longest  time  bin  of  the  histogram. 

In  the  case  of  residential  burglary,  evidence  indicates  that 
an  elevated  risk  exists  for  both  a  house  that  has  been  recently 
burgled  and  its  neighboring  houses  (Farrell  and  Pease  2001; 
Johnson  et  al.  2007;  Short  et  al.  2009).  To  illustrate  this  point 
further,  we  plot  in  Figure  1  a  histogram  of  the  times  between 
“nearby  burglaries,”  residential  burglaries  separated  in  space  by 
200  meters  or  less,  for  all  recorded  residential  burglaries  within 
an  18  km  by  18  km  region  of  the  San  Fernando  Valley  in  Los 
Angeles  during  2004-2005.  Again  we  observe  a  spike  at  short 
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times,  indicating  an  increased  likelihood  of  victimization  within 
a  few  hundred  meters  and  several  days  of  each  burglary. 

Self-excitation  is  also  found  in  gang  violence  data,  as  an 
event  involving  rival  gangs  can  lead  to  retaliatory  acts  of  vi¬ 
olence.  In  Figure  2,  we  plot  the  times  of  all  recorded  violent 
crimes  between  the  gang  known  as  “Locke  Street”  and  the  ri¬ 
val  gang  known  as  “Lowell  Street”  occurring  between  2000  and 
2002  in  the  Los  Angeles  police  district  of  Hollenbeck.  Here  we 
observe  clear  clustering  patterns  suggestive  of  self-excitation  in 
the  rate  at  which  the  two  rival  gangs  attack  each  other. 

We  propose  that  self-exciting  point  processes  can  be  adapted 
for  the  purpose  of  crime  modeling  and  are  well  suited  to  cap¬ 
ture  the  spatial-temporal  clustering  patterns  observed  in  crime 
data.  More  specifically,  spatial  heterogeneity  in  crime  rates  can 
be  treated  using  background  intensity  estimation  and  the  self¬ 
exciting  effects  detected  in  crime  data  can  be  modeled  with  a 
variety  of  kernels  developed  for  seismological  applications  or 
using  nonparametric  methods.  In  Section  2,  we  review  self¬ 
exciting  point  processes  in  the  context  of  seismological  mod¬ 
eling.  In  Section  3,  we  present  a  model  for  residential  bur¬ 
glary  based  upon  nonparametric  methods  for  Epidemic  Type 
Aftershock-Sequences  models  of  earthquakes.  Our  methodol¬ 
ogy  combines  the  idea  of  stochastic  declustering  with  Kernel 
Density  Estimation  in  a  novel  way.  In  Section  5,  we  compare 
the  predictive  accuracy  of  our  methodology  with  prospective 
crime  hotspot  maps.  The  results  illustrate  how  crime  hotspot 
maps  can  be  improved  using  the  self-exciting  point  process 
framework.  We  validate  the  methodology  with  a  simulated 
point  process  in  the  Appendix. 

2.  SELF-EXCITING  POINT  PROCESS  MODELS 
IN  SEISMOLOGY 

A  space-time  point  process  N(t ,  x,  y )  is  typically  character¬ 
ized  via  its  conditional  intensity  X(t,x,y),  which  may  be  de¬ 
fined  as  the  limiting  expected  rate  of  the  accumulation  of  points 
around  a  particular  spatial-temporal  location,  given  the  history 
Ht  of  all  points  up  to  time  t  (Daley  and  Vere-Jones  2003): 

X(t,  x,  y)  =  lim  (£T(V{ (f,  t  +  At)  x  (x,  x  +  Ax) 
A/,A.v,Ay|0v  L 

x  (y,y  +  Ay)}|7T,]/(AfAxAy)).  (1) 
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1.  On  the  left,  histogram  of  times  (less  than  300  days)  between  Southern  California  earthquake  events  of  magnitude  3.0  or  greater 
by  110  kilometers  or  less.  On  the  right,  histogram  of  times  (less  than  50  days)  between  burglary  events  separated  by  200  meters  or 


In  seismology  a  mark  Mk,  the  magnitude  of  the  earthquake,  is 
associated  with  each  event  (tk,  ay. ,  yk )  and  the  conditional  inten¬ 
sity  often  takes  the  form 

X(t,x,y,M)  =j(M)X(t,x,y),  (2) 

X(t,x,y )  =  p(x,y ) 

+  X!  S(t-tk,x-xk,y-yk\Mk).  (3) 

{k\tk<t) 

Models  of  this  type,  referred  to  as  Epidemic  Type  Aftershock- 
Sequences  (ETAS)  models,  work  by  dividing  earthquakes  into 
two  categories,  background  events  and  aftershock  events.  Back¬ 
ground  events  occur  independently  according  to  a  stationary 
Poisson  process  p.(x,y),  with  magnitudes  distributed  indepen¬ 
dently  of  //  according  to  j(M).  Each  of  these  earthquakes  then 
elevates  the  risk  of  aftershocks  and  the  elevated  risk  spreads  in 
space  and  time  according  to  the  kernel  g(t,  x,  y.  M). 


QExHJj  03000  O  'S®'  OO  O 
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Figure  2.  Times  of  violent  crimes  between  two  rivalry  gangs  in  Los 
Angeles. 


Many  forms  for  g  have  been  proposed  in  the  literature, 
though  in  general  the  kernel  is  chosen  such  that  the  elevated  risk 
increases  with  earthquake  magnitude  and  decreases  in  space 
and  time  away  from  each  event.  For  example,  the  isotropic  ker¬ 
nel. 


g(t,  x,  y;  M) 


Kq 

C t  +  c)P 


ea(M-M0) 

(x2  +y2  +  d)i  ’ 


(4) 


is  one  of  a  variety  of  kernels  reviewed  in  Ogata  (1998).  Here 
Kq,  Mo,  and  a  are  parameters  that  control  the  number  of  after¬ 
shocks,  c  and  d  are  parameters  that  control  the  behavior  of  the 
kernel  at  the  origin,  and  p  and  q  are  parameters  that  give  the 
(power  law)  rate  of  decay  of  g. 

Standard  models  for  the  background  intensity  /i  (x,  y)  include 
spline,  kernel  smoothing,  and  Voronoi  estimation  (Silverman 
1986;  Ogata  and  Katsura  1988;  Okabe  et  al.  2000).  In  the  case 
of  fixed  bandwidth  kernel  smoothing,  the  background  intensity 
is  estimated  by 


ix(x,y)  =  ix-'^2u(x-xk,y-yk;(r),  (5) 

k 

where  JZ  is  a  parameter  controlling  the  overall  background  rate. 
The  events  (tk,xk,  yk,Mk)  are  assumed  to  be  background  events 
and  in  practice  can  be  obtained  through  a  declustering  algo¬ 
rithm  (Zhuang,  Ogata,  and  Vere -Jones  2002). 

The  appropriate  selection  of  parameter  values  is  as  critical 
to  the  modeling  process  as  specifying  accurate  forms  for  /x, 
g,  and  j.  The  distance  in  space  and  time  over  which  the  risk 
spreads,  the  percentage  of  background  events  vs.  aftershocks, 
the  dependence  of  the  increased  risk  on  magnitude  size,  etc., 
all  can  have  a  great  impact  on  the  predictive  power  of  a  point 
process  model.  Parameter  selection  for  ETAS  models  is  most 
commonly  accomplished  through  maximum  likelihood  estima¬ 
tion,  where  the  log-likelihood  function  (Daley  and  Vere-Jones 
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2003), 

K0)  =  ^2\og{k(tk,xk,yk-,  0)} 
k 


■  /  /  /  X(t,x,y\9)dydxdt, 

Jo  J  Js 


(6) 


is  maximized  over  all  parameter  sets  6.  Here  S  x  [0,  T ]  is  the 
space-time  window  of  observation. 

More  recently,  nonparametric  methods  have  been  intro¬ 
duced  for  self-exciting  point  process  estimation  (Zhuang  2006; 
Marsan  and  Lenglin  2008).  Consider  space-time  point  data 
{(4-,  Xk,yk))k=i  ar,d  a  general  self-exciting  point  process  model 
of  the  form 


X(t,x,y)  =  p(t,x,y)  +  ^2  g(t-tk,x-xk,y-yk)-  (7) 

( k:tk<r } 


Assuming  model  correctness,  the  probability  that  event  i  is  a 
background  event,  pa,  is  given  by 


Pii  = 


fi(ti,Xi,yj ) 


(8) 


X(tj,xi,yi ) 

and  the  probability  that  event  j  triggered  event  i,  pjj,  is  given  by 


Pji  = 


g(ti 


fj,  Xj  -  xj,  yi  -  yj) 
X{ti,Xi,yi ) 


(9) 


(Zhuang,  Ogata,  and  Vere-Jones  2002).  Let  P  denote  the  matrix 
with  entries  pj\  (note  that  the  columns  sum  to  one).  Then  sto¬ 
chastic  declustering  can  be  used  in  the  following  way.  Given 
an  initial  guess  Pq  of  the  matrix  P,  we  then  have  N(N  + 
l)/2  probabilistic  data  points  {(tk,Xk,yk,Pkk)}k=i  and  {(f;  — 
tj,Xj  —  xj,  }’j  —  yj,pji)}i>j.  Given  this  data,  a  nonparametric 
density  estimation  procedure  can  be  used  to  estimate  p  from 
{(tk,Xk,yk,Pkk)}k=i  and  g  from  {(/,-  -  tj,xt  -xj,yi ~yj,Pji)}i>j, 
providing  estimates  po  and  g q.  We  can  then  proceed  iteratively 
as  follows  until  convergence  is  achieved: 


Step  1 .  Estimate  pn  and  g„  from  P„_  i . 

Step  2.  Update  Pn  from  pn  and  gn  using  (8)  and  (9). 


For  example,  a  simple  histogram  estimator  is  used  in  Marsan 
and  Lenglin  (2008)  in  step  1. 


3.  A  SELF-EXCITING  POINT  PROCESS 
MODEL  OF  BURGLARY 


For  the  purpose  of  modeling  burglary  we  consider  an  un¬ 
marked  self-exciting  model  for  the  conditional  intensity  of  the 
form 

\{t,x,y)  =  v{t)p(x,y)  +  ^2  8(t-tk,x-xk,y-yk)-  (10) 

[k:tk<t] 

Here  we  neglect  spatially  localized  temporal  fluctuations  in  the 
background  rate  and  assume  that  the  fluctuations  occur  globally 
(e.g.,  due  to  weather,  seasonality,  time  of  day,  etc.)  In  the  case 
of  seismology,  research  over  a  number  of  decades  was  needed 
to  refine  the  (parametric)  form  of  the  triggering  function  g.  For 
this  reason,  nonparametric  methods  are  appealing  in  the  con¬ 
text  of  crime  in  order  to  quickly  gain  insight  into  the  forms  of 
v,  p,  and  g.  For  this  purpose  we  use  the  iterative  procedure  out¬ 
lined  in  the  previous  section  to  estimate  the  model,  with  several 
modifications. 


Because  the  probabilistic  data  {(4, *k,yk, Pkk))k=i  and  Kh  ~ 
tj,  x,  —Xj, yi-yj,Pji)}i>j  is  both  three-dimensional  and  the  num¬ 
ber  of  data  points  is  0(N2)  [where  N  is  typically  (9(1000)  for 
earthquake  and  crime  datasets],  the  estimation  step  for  p  and 
g  is  computationally  expensive.  The  dimensionality  prevents 
straightforward  use  of  binning  methods  such  as  the  Average 
Shifted  Histogram  (Marsan  and  Lenglin  use  a  logarithmically 
scaled  histogram  on  a  coarse  grid),  as  many  bins  may  have  ex¬ 
tremely  small,  but  nonzero,  values  (since  the  data  is  probabilis¬ 
tic,  the  count  in  each  bin  can  be  less  than  1).  Alternatively,  the 
large  size  of  the  data  set  prevents  efficient  use  of  off-grid  meth¬ 
ods  such  as  Kernel  Density  Estimation.  To  get  around  these  is¬ 
sues  we  use  the  following  Monte  Carlo-based  iterative  proce¬ 
dure: 

Step  1.  Sample  background  events  {(f* ,  x^ ,  yf)}^  and  off- 

spring/parent  interpoint  distances  {(t°,x°,y°) fromH,,-]. 

Step  2.  Estimate  v„,  p„  and  gn  from  the  sampled  data  using 
variable  bandwidth  Kernel  Density  Estimation. 

Step  3.  Update  Pn  from  vn,  pn  and  gn  using  (8)  and  (9). 

Because  Nb  +  N0  =  N,  the  size  of  the  sampled  data  at  each 
iteration  allows  for  the  use  of  Kernel  Density  Estimation.  An¬ 
other  issue  is  that  the  number  of  background  and  offspring 
events,  Nb  and  N()1  is  changing  at  each  iteration.  Thus  a  fixed 
bandwidth  for  any  density  estimation  technique  (kernel  smooth¬ 
ing,  histogram,  etc.)  will  over-smooth  at  some  iterations  and 
under-smooth  at  others.  Therefore  we  employ  variable  band¬ 
width  KDE  (alternatively  Cross  Validation  could  be  used).  We 
give  further  details  of  our  approach  and  provide  validation  us¬ 
ing  a  simulated  point  process  in  the  Appendix. 

4.  RESULTS 

We  fit  the  model  given  by  Equation  (10)  to  a  dataset  collected 
by  the  Los  Angeles  Police  Department  of  5376  reported  resi¬ 
dential  burglaries  in  an  18  km  by  18  km  region  of  the  San  Fer¬ 
nando  Valley  in  Los  Angeles  occurring  during  the  years  2004 
and  2005.  Each  burglary  is  associated  with  a  reported  time  win¬ 
dow  over  which  it  could  have  occurred,  often  a  few  hour  span 
(for  instance,  the  time  span  over  which  a  victim  was  at  work), 
and  we  define  the  time  of  burglary  to  be  the  midpoint  of  each 
burglary  window. 

In  Figure  3,  we  plot  the  sampled  interpoint  distances  {(f°,  x" , 
f!/,)}/=i  on  t^le  75th  iteration  of  the  stochastic  declustering  al¬ 
gorithm  (see  Appendix  for  convergence  verification).  The  num¬ 
ber  of  sampled  (offspring)  events  is  706  (13.1%  of  all  events) 
and  of  these  events  approximately  63%  are  exact  repeats  (oc¬ 
curring  at  the  same  house).  On  the  left,  the  spatial  interpoint 
distances  are  plotted  showing  that  elevated  crime  risk  travels 
around  50  m-100  m  from  the  house  of  an  initial  burglary  to 
the  location  of  direct  offspring  events.  As  discussed  in  Marsan 
and  Lenglin  (2008),  the  overall  distance  near-repeat  risk  travels 
is  several  times  further  due  to  the  cascading  property  of  self¬ 
exciting  point  processes.  Note  also  that  the  risk  travels  verti¬ 
cally  and  horizontally  (along  streets),  more  so  than  it  does  in 
other  directions.  On  the  right,  we  plot  the  spatial  (v-coordinate) 
interpoint  distances  against  the  time  interpoint  distances.  Here 
exact-repeat  burglaries,  those  occurring  at  the  same  house,  are 
apparent  along  the  x-axis. 
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Figure  3.  Spatial  (left)  and  space-time  (right)  offspring/parent  interpoint  distances  {(t° ,  xP,  y?)}^  sampled  from  P-j 5. 


In  Figure  4,  we  plot  (on  a  logarithmic  scale)  the  estimated 
marginals 


£75  (?)  = 


£75  (f,  x,  y)  dxdy 


and 


£75  W  = 


£75  0,*,  y)dydt 


computed  from  the  KDE  estimate  of  g  at  the  75th  iteration.  Here 
the  presence  of  exact-repeat  events  can  again  be  seen,  as  £75  (x) 
appears  to  approximate  a  delta  distribution  at  the  origin.  The 
spike  around  1-2  days  in  the  plot  of  £75  (f)  is  due  to  the  pres¬ 
ence  of  fast  “crime  sprees,”  where  most  likely  the  same  burglar 
visited  several  neighboring  houses  within  a  time  span  of  a  few 
minutes  to  several  days.  There  are  also  several  bumps  in  the  el¬ 
evated  risk  of  burglary,  for  example,  around  7  days.  Here  one 


possible  explanation  is  that  the  routine  of  the  burglar  and/or 
the  victim  is  such  that  a  particular  day  of  the  week  is  a  prefer¬ 
able  time  to  commit  the  burglary.  After  7-10  days,  the  elevated 
risk  of  repeat/near-repeat  victimization  drops  to  an  intermedi¬ 
ate  level  and  stays  relatively  flat  for  a  time  span  on  the  order 
of  several  hundred  days  before  decaying  back  to  baseline  rates. 
These  results  are  consistent  with  previous  quantitative  studies 
of  exact-repeat  burglaries  (Short  et  al.  2009). 

In  Johnson  (2008),  the  authors  discuss  the  need  to  model 
risk  heterogeneity  and  in  general  it  is  a  difficult  task  to  sepa¬ 
rate  clustering  due  to  background  heterogeneity  and  clustering 
due  to  self-excitation.  One  benefit  of  using  the  nonparametric 
approach  outlined  above  is  that  temporal  and  spatial  changes 
in  the  rate  of  crime  are  automatically  separated  into  those  stem¬ 
ming  from  exogenous  effects  and  those  due  to  self-excitation.  In 
Figure  5,  we  plot  the  estimated  marginals  1775(f)  and  /X7 s(x,y) 
estimated  using  KDE  from  {(f(fc,  x1-,  y f)}^x  at  the  75th  iteration. 


x-coordinate  (km) 


Figure  4.  Marginal  £75  (f)  (left)  and  marginal  £75(71:)  (right)  estimated  using  KDE  based  upon  offspring/parent  interpoint  distances  sampled 
from  P75 . 
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Figure  5.  Background  rate  time  marginal  V75  (t)  (left)  and  space  marginal  /T75  (x,  y)  (right)  estimated  using  KDE  from  the  background  events 
sampled  from  P75 . 


Here  the  estimated  background  rate  exhibits  temporal  fluctua¬ 
tions  on  a  time  scale  of  months/days,  separate  from  the  fluctua¬ 
tions  due  to  self-excitation.  These  fluctuations  are  likely  caused 
by  a  number  of  factors  such  as  seasonal,  economic,  and  demo¬ 
graphic  changes,  as  well  as  temporal  variations  in  burglar  rou¬ 
tine  activities  (Felson  1998).  For  example,  residential  burglary 
tends  to  have  a  higher  weekday  rate  (when  victims  are  at  work) 
compared  to  weekends. 

Similarly,  the  background  rate  is  also  spatially  variable, 
which  is  consistent  with  fixed  environmental  heterogeneity  in 
crime  opportunities,  as  well  as  variability  in  population  density 
through  space  (Bernasco  and  Nieuwbeerta  2005).  In  seismol¬ 
ogy,  declustered  catalogs  are  of  great  interest  as  they  can  be 
used  in  estimating  the  background  rate  of  major  earthquakes. 
Declustered  crime  catalogs  could  potentially  be  used  by  police 
to  distinguish  between  areas  of  a  city  with  intrinsically  high 
crime  rates  and  areas  with  temporarily  high  crime  rates  (due  to 
near-repeat  effects).  As  the  former  arises  due  to  structural  prop¬ 
erties  of  a  given  neighborhood  and  the  latter  from  behavioral 
characteristics  of  individual  burglars,  police  and  community  re¬ 
sponses  would  likely  need  to  be  different  in  each  case. 

5.  CRIME  FORECASTING:  POINT  PROCESSES 
VERSUS  HOTSPOT  MAPS 

Crime  hotspot  maps  are  a  well-established  tool  for  visualiza¬ 
tion  of  space-time  crime  patterns  and  can  be  used  as  a  method 
for  prediction  of  near- repeat  crimes.  Given  space-time  crime 
observations  (4,  xy,  yG,  crime  hotspot  maps  are  generated  for  a 
time  interval  [f  —  T,  t ]  by  overlaying  a  density  plot  of  the  func¬ 
tion, 

A .(t,x,y)=  ^2  g(t-tk,x-Xk,y-yk),  (11) 

t—T<tic<t 

onto  a  city  map,  where  g(t,  x.  y )  is  a  space-time  kernel.  By  flag¬ 
ging  the  areas  of  the  city  where  X  takes  on  its  highest  values, 
crime  hotspot  maps  can  be  used  to  indicate  which  areas  in  the 


city  are  likely  to  contain  future  crimes  (Bowers,  Johnson,  and 
Pease  2004;  Chainey,  Tompson,  and  Uhlig  2008). 

For  example,  in  Bowers,  Johnson,  and  Pease  (2004)  a  coarse 
grained  kernel  is  used  that  decays  inversely  proportional  to  spa¬ 
tial  and  temporal  distance.  In  particular,  with  spatial  distance 
d  in  units  of  1  /2  cell  widths  and  time  t  in  units  of  weeks,  the 
kernel  in  (1 1)  is  specified  as 


g(t,d) 


1 

(l+0(l+d) 


(12) 


on  the  domain  (t,  d)  e  [0,  2  months]  x  [0,  400  meters]  and  0 
otherwise.  Such  a  crime  hotspot  map  is  referred  to  as  “prospec¬ 
tive,”  as  it  uses  past  crimes,  coupled  with  the  contagious  spread 
of  crime  (modeled  by  g),  to  estimate  future  relative  crime  risk 
across  the  city.  It  should  be  noted  that  the  risk  is  relative  because 
(1 1)  is  not  a  point  process  intensity. 

Here  we  compare  the  predictive  accuracy  of  the  self-exciting 
point  process  model  developed  in  Section  3  to  the  prospective 
crime  hotspot  map  given  by  (1 1)— (12).  Because  crime  is  local¬ 
ized  in  small  regions  of  the  city  (a  commercial  zone  with  no  res¬ 
idential  burglary  may  be  located  100  meters  from  a  neighbor¬ 
hood),  we  find  that  for  predictive  purposes  variable  bandwidth 
KDE  is  less  accurate  than  fixed  bandwidth  KDE.  We  there¬ 
fore  estimate  /r(x, y)  in  Equation  (10)  using  fixed  bandwidth 
Gaussian  KDE,  with  20-fold  cross  validation  used  to  select  the 
bandwidth  (Silverman  1986). 

For  every  day  k  of  2005,  each  model  assesses  the  risk  of  bur¬ 
glary  within  each  of  M 2  cells  partitioning  an  18  km  by  18  km 
region  of  the  San  Fernando  Valley  in  Los  Angeles.  Based  on  the 
data  from  the  beginning  of  2004  up  through  day  k ,  the  N  cells 
with  the  highest  risk  (value  of  X)  are  flagged  yielding  a  predic¬ 
tion  for  day  k  +  1 .  The  percentage  of  crimes  falling  within  the 
flagged  cells  on  day  k  +  1  is  then  recorded  and  used  to  measure 
the  accuracy  of  each  model. 

In  Figure  6,  on  the  left  we  plot  the  percentage  of  crimes 
predicted  averaged  over  the  forecasting  year  against  the  per¬ 
centage  of  flagged  cells  for  the  self-exciting  point  process  and 
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Figure  6.  Forecasting  strategy  comparison.  Average  daily  percentage  of  crimes  predicted  plotted  against  percentage  of  cells  flagged  for  2005 
burglary  using  200  m  by  200  m  cells.  Error  bars  correspond  to  the  standard  error.  Prospective  hotspot  cutoff  parameters  are  400  meters  and  8 
weeks  (left)  and  optimal  parameters  (right)  are  200  meters  and  39  weeks.  Spatial  background  intensity  smoothing  bandwidth  for  the 

point  process  is  300  meters  (left)  selected  by  cross  validation  and  130  meters  (right)  selected  to  optimize  the  number  of  crimes  predicted. 


the  prospective  hotspot  strategy.  For  example,  with  10%  of 
the  city  flagged  the  point  process  and  prospective  hotspot  cor¬ 
rectly  predict  660  and  547  crimes  (respectively)  out  of  2627. 
The  difference  in  accuracy  between  the  two  methodologies  can 
be  attributed  to  the  crime  hotspot  map’s  failure  to  account  for 
the  background  rate  of  crime.  While  prospective  crime  hotspot 
maps  used  for  crime  prediction  attempt  to  quantify  the  conta¬ 
gious  spread  of  crime  following  past  events,  they  fail  to  assess 
the  likelihood  of  future  “background”  events,  the  initial  events 
that  trigger  crime  clusters. 

In  order  to  disentangle  the  dependence  of  model  accuracy  on 
parameter  selection,  in  Figure  6  on  the  right  we  repeat  the  same 
prediction  exercise  but  with  parameters  of  each  model  selected 
to  yield  the  highest  number  of  crimes  predicted  (L1  norm  over 
1  through  15%  of  cells  flagged).  The  optimal  cutoff  parameters 
for  the  prospective  hotspot  map  are  200  meters  and  39  weeks. 
With  these  parameter  values,  in  particular  the  slow  decay  of  g 
in  time.  Equation  (11)  is  closer  to  Poisson  estimation.  For  the 
point  process  model  we  only  optimize  the  bandwidth  used  for 
/ i(x,y )  as  the  computational  cost  of  the  stochastic  decluster¬ 
ing  algorithm  is  relatively  high.  Whereas  the  bandwidth  is  esti¬ 
mated  to  be  approximately  300  meters  using  cross  validation,  a 
smaller  bandwidth,  130  meters,  provides  a  higher  level  of  pre¬ 
dictive  accuracy.  This  can  be  attributed  to  the  spatially  localized 
features  of  neighborhoods,  and  hence  burglary. 

For  all  percentages  of  cells  flagged  the  prospective  hotspot 
map  underperforms  the  point  process,  though  for  certain  per¬ 
centages  the  relative  underperformance  is  less.  On  the  left  in 
Figure  6,  the  prospective  hotspot  map  performs  better  (relative 
to  the  point  process)  for  smaller  percentages  of  cells  flagged,  as 
the  parameters  are  selected  to  account  for  near-repeat  effects. 
On  the  right,  the  prospective  hotspot  map  performs  better  for 
larger  percentages  of  flagged  cells,  since  for  these  parameter 
values  the  model  is  more  accurately  estimating  fixed  environ¬ 
mental  heterogeneity.  For  crime  types  such  as  robbery  and  auto 


theft,  where  near-repeat  effects  play  less  of  a  role,  prospective 
hotspot  maps  tailored  for  near-repeat  effects  are  likely  to  be 
outperformed  by  simple  Poisson  estimation.  The  advantage  of 
models  of  the  form  (10)  is  that  the  balance  between  exogenous 
and  endogenous  contributions  to  crime  rates  is  inferred  from 
the  data  as  opposed  to  being  imposed  a  priori. 

6.  DISCUSSION 

We  showed  how  self-exciting  point  processes  from  seismol¬ 
ogy  can  be  used  for  the  purpose  of  crime  modeling.  In  the  fu¬ 
ture  it  may  be  desirable  to  tailor  point  process  models  specifi¬ 
cally  for  crime,  taking  into  account  the  crime  type  and  the  lo¬ 
cal  geography  of  the  city.  Based  upon  the  insights  provided  by 
nonparametric  estimates,  parametric  models  can  be  constructed 
that  have  advantages  with  respect  to  model  fitting  and  simula¬ 
tion.  Background  rates  can  also  be  improved  by  incorporating 
other  data  types  (in  Johnson  2008,  housing  density  is  used  to 
improve  models  of  repeat  victimization).  In  the  case  of  gang 
violence,  a  hybrid  network-point  process  approach  may  be  use¬ 
ful  for  capturing  the  self-exciting  effects  stemming  from  gang 
retaliations.  Here  increased  risk  may  not  diffuse  in  geographic 
space,  but  instead  may  travel  through  the  network  space  of  gang 
rivalry  relations. 

The  methodology  used  in  this  study  can  be  implemented  for 
other  applications  as  well,  for  example  refining  point  process 
models  of  earthquakes.  It  could  potentially  be  adapted,  more 
generally,  to  other  second-order  models  of  point  processes. 
The  stochastic  declustering  algorithm  opens  up  the  door  to 
a  plethora  of  density  estimation  techniques  (Silverman  1986; 
Scott  1992;  Eggermont  and  LaRiccia  2001)  that  could  be  used 
to  explore  point  processes  in  a  way  parametric  methods  do  not 
allow. 

In  Marsan  and  Lenglin  (2010)  it  is  shown  that  the  method 
is  an  Expectation-Maximization  (EM)  type  algorithm.  At  the 
maximization  step  the  complete  data  log-likelihood  function 
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decouples  in  terms  of  the  background  and  triggering  functions, 
which  is  why  at  each  iteration  the  problem  reduces  to  several 
decoupled  density  estimation  problems.  Several  issues  could 
potentially  arise  here,  one  being  that  the  method  could  con¬ 
verge  to  a  local  (but  not  global)  minimum  of  the  observed  data 
log-likelihood  function.  Another,  as  pointed  out  in  Sornette  and 
Utkin  (2009),  is  that  the  sample  size  and  domain  size  (relative 
to  the  support  of  the  triggering  kernel)  play  a  key  role  in  the 
accuracy  of  stochastic  declustering.  In  numerical  tests  we  have 
found  that  at  least  (9(1000)  data  points  are  needed  in  three  di¬ 
mensions  for  the  iterates  to  converge  to  the  right  densities  and 
the  domain  needs  to  be  several  times  larger  than  the  support 
of  the  triggering  kernel.  Similar  to  analytical  results  for  stan¬ 
dard  density  estimation,  it  would  be  useful  to  have  convergence 
results  relating  sample  size,  branching  ratio,  domain  size,  and 
the  bandwidth  of  the  density  estimators  to  the  solution  of  the 
fixed-point  iteration. 


APPENDIX 

Given  point  data  (ft,  xk,  yk)k—i  and  a  self-exciting  point  process 
model  of  the  form, 

k(t,x,y)  =  v(t)n(x,y)+  ^  g(t  -  tk,x  - xk,y  -  yk),  (A.l) 

jfc  tk<t] 

we  iterate  the  following  until  convergence: 

Step  1.  Sample  background  events  {(tf ,  rf ,  yf)}^i  and  offspring/ 
parent  interpoint  distances  {(t?,  x?,  y?)}^? j  from  Pn_  \ . 

Step  2.  Estimate  vn,  ptn,  and  gn  from  the  sampled  data. 

Step  3.  Update  Pn  from  vn,  p,n,  and  gn  using  (8)  and  (9). 

In  order  to  estimate  v„,  //„,  and  gn  from  the  sampled  data,  we  use 
variable  bandwidth  Kernel  Density  Estimation.  To  estimate  g„,  we  first 
scale  the  data  {(t° ,  x°,  _y?)l'^f|  to  have  unit  variance  in  each  coordinate 
and  based  upon  the  rescaled  data  compute  Dj,  the  kth  nearest  neighbor 
distance  (three-dimensional  Euclidean  distance)  to  data  point  i.  We 
then  transform  the  data  back  to  its  original  scale  and,  letting  ax,  cry. 


40  80 

#  Iterations 


Figure  A.l.  L2  error  \\Pn  —  Pn-\ || 2  (top  left)  and  Nk,  the  number  of  sampled  background  events,  (top  right)  at  the  /7th  iteration  for  known 
point  process  model.  L2  error  \\Pn  —  Pn_  \  || 2  (bottom  left)  and  A^,  the  number  of  sampled  background  events,  (bottom  right)  at  the  nth  iteration 
for  the  method  applied  to  the  5376  burglary  events  in  Section  3. 
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and  tjf  be  the  sample  standard  deviation  of  each  coordinate,  estimate 
the  triggering  function  as 


1 

axOy0t  (2tt) 


/  (x-x?)2  (y-y?)2  (t  -  ff  )2  \ 

X6XPl  2 a}D2  2ffyD2  2a2D2  ) 

The  background  rate  is  estimated  similarly,  where  one-dimensional 
and  two-dimenional  Gaussian  kernels  are  used  to  estimate  vn  and  jin, 
respectively.  In  Zhuang,  Ogata,  and  Vere-Jones  (2002),  the  authors  rec¬ 
ommending  using  the  10th- 100th  nearest  neighbor  distance  for  Dj. 
Throughout  we  compute  Dj  corresponding  to  v  using  the  100th  near¬ 
est  neighbor  distance  and  in  higher  dimensions  we  use  the  15th  nearest 
neighbor  distance  for  Dj  corresponding  to  fj.  and  g. 

We  validate  the  method  by  simulating  (A.l)  with 

M  /  x2  \  /  y2  \ 

v(t)fi(x,y)  = - exp - ^  exp - - - T 

(27r )  (4.5)2  2(4.5)2/  2(4. 5)2  / 

and 


g(t,x,y)  =  @wexp(— cot)  exp  I  — 


and  comparing  the  estimates  supplied  by  the  method  with  the  known 
distribution.  The  simulation  was  carried  out  by  first  simulating  all 
background  events  according  to  the  Poisson  process  vji.  The  rest  of  the 
simulation  was  carried  out  iteratively,  where  each  point  of  each  gen¬ 
eration  generates  its  own  offspring  according  to  the  Poisson  process 
g  centered  at  the  parent  point.  The  process  terminates  at  the  nth  gen¬ 
eration  when  all  events  of  the  nth  generation  lie  outside  of  the  time 
window  under  consideration.  In  order  to  have  a  realization  of  the  point 
process  at  steady  state,  the  first  and  last  2000  points  were  disregarded 
in  each  simulation. 

In  Figure  A.l,  we  plot  the  L2  error  \\Pn  —  Pn-\  || 2  at  the  nth  it¬ 
eration  and  the  number  of  sampled  background  events  Nj,  at  the  /7th 
iteration  against  the  true  number  of  background  events  for  one  real¬ 
ization  of  the  known  point  process.  Here  we  observe  that  the  error 
converges  quickly  for  the  first  10  iterations  and  then  stabilizes  as  the 
error  introduced  by  estimating  the  point  process  through  sampling  P 
cannot  be  reduced  further  (unless  a  deterministic  iterative  procedure 
is  employed).  We  also  verify  that  the  method  applied  to  the  5376  bur¬ 
glary  events  in  Section  3  reached  convergence  in  Figure  A.l.  Here  we 
observe  a  similar  rate  of  convergence  for  the  crime  data  as  with  the 
simulated  point  process. 

In  Table  A.l,  we  list  the  exact  parameter  values  used  for  the  simu¬ 
lated  point  process  and  the  estimates  averaged  over  the  final  10  itera¬ 
tions  of  the  stochastic  declustering  algorithm  for  each  of  five  simula¬ 
tions  of  the  point  process.  The  parameter  values  were  selected  to  yield 
point  patterns  with  scales  similar  to  those  observed  in  crime  data.  The 
parameter  estimates  are  computed  using  the  sample  variances  of  the 
coordinates  of  {(t°,x°,  y?)}^j  and  the  values  of  Nj,,  Na.  As  some  er¬ 
ror  is  due  to  sample  variation,  we  plot  in  the  last  two  columns  the 


Table  A.  1 .  Parameter  value  estimates 


w-1 

ox 

°y 

e 

M 

Nj,  est. 

Nj,  true 

True  values 

10.00 

0.0100 

0.1000 

0.2000 

5.7100 

Run  1  est. 

11.08 

0.0176 

0.1433 

0.2001 

5.6921 

3999.7 

4041 

Run  2  est. 

12.20 

0.0156 

0.1296 

0.1967 

5.7768 

4016.5 

4026 

Run  3  est. 

11.76 

0.0150 

0.1295 

0.1997 

5.6711 

4001.5 

4017 

Run  4  est. 

13.30 

0.0135 

0.1407 

0.2049 

5.6185 

3975.3 

4015 

Run  5  est. 

11.27 

0.0147 

0.1317 

0.2102 

5.7652 

3948.9 

3977 

estimated  number  of  background  events  versus  the  actual  number  of 
background  events  in  each  of  the  five  simulations  to  assess  the  abil¬ 
ity  of  the  method  to  reconstruct  the  realized  branching  structure.  In 
Figure  A.2,  we  plot  the  estimated  marginals  of  g(f,  x,  y)  against  the 


Figure  A.2.  Estimated  (circles)  and  actual  (solid  line)  marginals  of 
g(f,  x,  y)  on  the  75th  iteration.  Top  is  the  marginal  g(x),  in  the  middle 
is  the  marginal  giy),  and  lower  figure  is  marginal  g(t). 
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actual  distributions  on  the  75th  iteration  of  the  stochastic  declustering 
algorithm.  The  estimated  time  marginal  density  deviates  from  the  true 
density  at  the  origin  due  to  the  jump  discontinuity  of  the  exponential 
distribution.  However,  the  estimate  of  the  parameter  co  is  still  close  to 
the  true  value  (see  Table  A.l). 

[Received  September  2009.  Revised  October  2010.] 
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